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Ph In the context of nonparametric Bayesian estimation a Markov chain Monte Carlo algo- 

rithm is devised and implemented to sample from the posterior distribution of the drift 
function of a continuously or discretely observed one-dimensional diffusion. The drift is 
modeled by a scaled linear combination of basis functions with a Gaussian prior on the co- 
efficients. The scaling parameter is equipped with a partially conjugate prior. The number 
of basis function in the drift is equipped with a prior distribution as well. For continuous 
data, a reversible jump Markov chain algorithm enables the exploration of the posterior 
over models of varying dimension. Subsequently, it is explained how data- augment at ion 
can be used to extend the algorithm to deal with diffusions observed discretely in time. 
Some examples illustrate that the method can give satisfactory results. In these examples 

q a comparison is made with another existing method as well. 



X 



1. Introduction 



ON 

Suppose we observe a diffusion process X, given as the solution of the stochastic dif- 
_^ ferential equation (SDE) 



dX t = b(X t )dt+ dW t , X = x , (1) 



5^ 

with initial state Xq and unknown drift function b. The aim is to estimate the drift b when 
a sample path of the diffusion is observed continuously up till a time T > or at discrete 
times 0, A, 2A, . . . , nA, for some A > and n G N. 

Diffusion models are widely employed in a variety of scientific fields, including physics, 
economics and biology. Developing methodology for fitting SDEs to observed data has 
therefore become an important problem. In this paper we restrict the exposition to the 
case that the drift function is 1-periodic and the diffusion function is identically equal to 
1. This is motivated by applications in which the data consists of recordings of angles, 
cf. e.g. Pokern (2007), Hindriks (2011) or Papaspiliopoulos et al. (2012). The methods we 
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propose can however be adapted to work in more general setups, such as ergodic diffusions 
with non-unit diffusion coefficients. In the continuous observations case, a diffusion with 
periodic drift could alternatively be viewed as diffusion on the circle. Given only discrete 
observations on the circle, the information about how many turns around the circle the 
process has made between the observations is lost however and the total number of windings 
is unknown. For a lean exposition we concentrate therefore on diffusions with periodic 
drift on M. In the discrete observations setting the true circle case could be treated by 
introducing a latent variable that keeps track of the winding number. 

In this paper we propose a new approach to making nonparametric Bayesian inference 
for the model (1). A Bayesian method can be attractive since it does not only yield an 
estimator for the unknown drift function, but also gives a quantification of the associated 
uncertainty through the spread of the posterior distribution, visualized for instance by 
pointwise credible intervals. Until now the development of Bayesian methods for diffu- 
sions has largely focussed on parametric models. In such models it is assumed that the 
drift is known up to a finite-dimensional parameter and the problem reduces to making 
inference about that parameter. See for instance the papers Eraker (2001), Roberts and 
Stramer (2001), Beskos et al. (2006a), to mention but a few. When no obvious parametric 
specification of the drift function is available it is sensible to explore nonparametric esti- 
mation methods, in order to reduce the risk of model misspecification or to validate certain 
parametric specifications. The literature on nonparametric Bayesian methods for SDEs is 
however still very limited at the present time. The only paper which proposes a practical 
method we are aware of is Papaspiliopoulos et al. (2012). The theoretical, asymptotic 
behavior of the procedure of Papaspiliopoulos et al. (2012) is studied in the recent paper 
Pokern et al. (2013). Other papers dealing with asymptotics in this framework include 
Panzar and van Zanten (2009) and Van der Meulen and van Zanten (2013), but these do 
not propose practical computational methods. 

The approach we develop in this paper extends or modifies that of Papaspiliopoulos 
et al. (2012) in a number of directions and employs different numerical methods. Pa- 
paspiliopoulos et al. (2012) consider a Gaussian prior distribution on the periodic drift 
function b. This prior is defined as a Gaussian distribution on L 2 [0, 1] with densely defined 
inverse covariance operator (precision operator) 

V ((-Ay + Ki), (2) 

where A is the one-dimensional Laplacian (with periodic boundaries conditions), I is the 
identity operator and i], k > and p 6 N are fixed hyper parameters. It is asserted in 
Papaspiliopoulos et al. (2012) and proved in Pokern et al. (2013) that if the diffusion is 
observed continuously, then for this prior the posterior mean can be characterized as the 
weak solution of a certain differential equation involving the local time of the diffusion. 
Moreover, the posterior precision operator can be explicitly expressed as a differential 
operator as well. Posterior computations can then be done using numerical methods for 
differential equations. 

To explain our alternative approach we note, as in Pokern et al. (2013), that the prior 
just defined can be described equivalently in terms of series expansions. Define the basis 
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functions ipk G L 2 [0, 1] by setting ipi = 1, and for k e N ^2fc(^) = \/2 sin(2A;7rx) and 
ip2k+i(x) = y/2cos(2knx). Then the prior is the law of the random function 



oo 



x i — y 



2j y/XiZiij}i(x) 



1=1 



where the Z\ are independent, standard normal variables and for / > 2 




) 



-1 



(3) 



This characterization shows in particular that the hyper parameter p describes the regu- 
larity of the prior through the decay of the Fourier coefficients and I/77 is a multiplicative 
scaling parameter. The priors we consider in this paper are also defined via series expan- 
sions. However, we make a number of substantial changes. 

Firstly, we allow for different types of basis functions. Different basis functions instead 
of the Fourier-type functions may be computationally attractive. The posterior computa- 
tions involve the inversion of certain large matrices and choosing basis functions with local 
support typically makes these matrices sparse. In the general exposition we keep the basis 
functions completely general but in the simulation results we will consider wavelet-type 
Faber-Schauder functions in addition to the Fourier basis. A second difference is that we 
truncate the infinite series at a level that we endow with a prior as well. In this manner 
we can achieve considerable computational gains if the data driven truncation point is 
relatively small, so that only low- dimensional models are used and hence only relatively 
small matrices have to be inverted. A last important change is that we do not set the 
multiplicative hyper parameter at a fixed value, but instead endow it with a prior and let 
the data determine the appropriate value. 

We will present simulation results in Section 4 which illustrate that our approach indeed 
has several advantages. Although the truncation of the series at a data driven point involves 
incorporating reversible jump MCMC steps in our computational algorithm, we will show 
that it can indeed lead to a considerably faster procedure compared to truncating at some 
fixed high level. The introduction of a prior on the multiplicative hyper parameter reduces 
the risk of misspecifying the scale of the drift. We will show in Section 4.2 that using a 
fixed scaling parameter can seriously deteriorate the quality of the inference, whereas our 
hierarchical procedure with a prior on that parameter is able to adapt to the true scale of 
the drift. A last advantage that we will illustrate numerically is that by introducing both 
a prior on the scale and on the truncation level we can achieve some degree of adaptation 
to smoothness as well. 

Computationally we use a combination of methods that are well established in other 
statistical settings. Within models in which the truncation point of the series is fixed we 
use Gibbs sampling based on standard inverse gamma-normal computations. We combine 
this with reversible jump MCMC to move between different models. For these moves, we 
can use an auxiliary Markov chain to propose a model, and subsequently draw coefficients 
from their posterior distribution within that model. Such a scheme has been proposed for 
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example in Godsill (2001) for estimation in autoregressive time-series models. In case of 
discrete observations we also incorporate a data augmentation step using a Metropolis- 
Hastings sampler to generate diffusion bridges. Our numerical examples illustrate that 
using our algorithm it is computationally feasible to carry out nonparametric Bayesian 
inference for low-frequency diffusion data using a non-Gaussian hierarchical prior which is 
more flexible than previous methods. 

A brief outline of the article is as follows: In Section 2 we give a concise prior specifi- 
cation. In the section thereafter, we present the reversible jump algorithm to draw from 
the posterior for continuous-time data. Data-augmentation is discussed in Section 3.3. 
In Section 4 we give some examples to illustrate our method. We end with a section on 
numerical details. 

2. Prior distribution 

2.1. General prior specification 

To define our prior on the periodic drift function b we write a truncated series expansion 
for b and put prior weights on the truncation point and on the coefficients in the expansion. 
We employ general 1-periodic, continuous basis functions 4 I E N. In the concrete 
examples ahead we will consider in particular Fourier and Faber-Schauder functions. We 
fix an increasing sequence of natural numbers rrij, j G N, to group the basis functions 
into levels. The functions ipi, . . . ,ip mi constitute level 1, the functions VVm+ij • • • > VVj 
correspond to level 2, etcetera. In this manner we can accommodate both families of basis 
functions with a single index (e.g. the Fourier basis) and doubly indexed families (e.g. 
wavelet-type bases) in our framework. Functions that are linear combinations of the first 
rrij basis functions ipi, . . . , ip m . are said to belong to model j. Model j encompasses levels 
1 up till j. 

To define the prior on b we first put a prior on the model index j, given by certain prior 
weights p(j), j G N. By construction, a function in model j can be expanded as Xw=i ^i^i 
for a certain vector of coefficients 9^ G K mj . Given j, we endow this vector with a prior by 
postulating that the coefficients 9j are given by an inverse gamma scaling constant times 
independent, centered Gaussians with decreasing variances I G N. The choice of the 
constants £f is discussed in Sections 2.2.1 and 2.2.2. 

Concretely, to define the prior we fix model probabilities p(j), j G N, decreasing vari- 
ances £f, positive constants a, b > and set S J = diag(£ 2 , . . . ,^ n .)- Then the hierarchical 
prior II on the drift function b is defined as follows: 
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j ~p(i), 

s 2 ~ IG(a,6), 

b\j,s 2 ,e^J2oi^- 
1=1 



2. 2. Specific basis functions 

Our general setup is chosen such that we can incorporate bases indexed by a single 
number and doubly indexed (wavelet-type) bases. For the purpose of illustration, one 
example for each case is given below. First, a Fourier basis expansion, which emphasizes 
the (spectral) properties of the drift in frequency domain and second, a (Faber-) Schauder 
system which features basis elements with local support. 

2.2.1. Fourier basis 

In this case we set mj = 2j — 1 and the basis functions are defined as 

■01 = 1, ip2k(x) = V% sin(2/c7rx), ^2k+i( x ) — y/2cos(2kiTx), fceN. 

These functions form an orthonormal basis of L 2 [0, 1] and the decay of the Fourier co- 
efficients of a function is related to its regularity. More precisely, if / = J2i>i Qi^i an d 
J2i>i @?l 2lS < oo for /3 > 0, then / has Sobolev regularly (3, i.e. it has square integrable 
weak derivatives up to the order (3. By setting £f ~ Z~ 1_2/3 for (3 > 0, we obtain a prior 
which has a version with a- Holder continuous sample paths for all a < /3. A possible 
choice for the model probabilities is to take them geometric, i.e. p(j) ~ exp(— Crrij) for 
some C > 0. 

Priors of this type are quite common in other statistical settings. See for instance Zhao 
(2000) and Shen and Wasserman (2001), who considered priors of this type in the context 
of the white noise model and nonparametric regression. The prior can be viewed as an 
extension of the one of Papaspiliopoulos et al. (2012) discussed in the introduction. The 
latter uses the same basis functions and decreasing variances with (3 = p — 1/2. It does 
not put a prior on the model index j however (it basically takes j = oo) and uses a fixed 
scaling parameter whereas we put a prior on s. In Section 4 we argue that our approach 
has a number of advantages. 

2.2.2. Schauder functions 

The Schauder basis functions are a location and scale family based on the "hat" function 
A(x) = (2x)lr 0) i)(a;) + 2(x — l)lji -^(x). With rrij = 2 3 ' -1 , the Schauder system is given by 
■01 = 1 and for I > 2 ipi(x) — A; (a; mod 1), where 

Aai-i+fcfr) = Ai^^x-k + l), j>l, k = l,...,2 j ~ l . 
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These functions have compact supports. A Schauder expansion thus emphasizes local 
properties of the sample paths. For (3 £ (0, 1), a function / with Faber-Schauder expansion 

/ = Y.i>i c i^i = c i^i + J2j>i Y%=\ ^-i+fc^-i+fc has Holder regularity of order (3 if and 
only if \ci\ < const, x for every I (see for instance Kashin and Saakyan (1989)). It 
follows that if in our setup we take ^- 1 +k — 2 _/3j for j > 1 and k — 1, . . . , 2 jf_1 , then we 
obtain a prior with regularity /3. A natural choice for p(j) is again p(j) ~ exp(— Crrij). 

The Schauder system is well known in the context of constructions of Brownian motion, 
see for instance Rogers and Williams (2000). The Brownian motion case corresponds to 
prior regularity = 1/2. 

3. The Markov chain Monte Carlo sampler 

3. 1 . Posterior within a fixed model 

When continuous observations x T = (x t : t £ [0,T]) from the diffusion model (1) 
are available, then we have an explicit expression for the likelihood p(x T \ b). Indeed, by 
Girsanov's formula we almost surely have 



Cf. e.g. Liptser and Shiryaev (2001). Note in particular that the log-likelihood is quadratic 
in b. 

Due to the special choices in the construction of our hierarchical prior, the quadratic 
structure implies that within a fixed model j, we can do partly explicit posterior compu- 
tations. More precisely, we can derive the posterior distributions of the scaling constant s 2 
and the vector of coefficients Qi conditional on all the other parameters. The continuous 
observations enter the expressions through the vector ^ £ IR mj and the rrij x rrij matrix 
S J defined by 



p(x T | b) = exp 



(Jo b ^-lJ 




(4) 




(5) 



and 




(6) 



Lemma 1. We have 



V \s 2 }3} x T ~ iv mj ((^)-V '^WT 1 ), 

s 2 \9\j,x T ~ IG(o+ (1/2)7^,6+ (1/2)(0 J ') T (H J ')-V), 



where W j = E j + (s 2 ^)- 1 
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Proof. The computations are straightforward. We note that by Girsanov's formula (4) and 
the definitions of and E- 7 we have 



p(x T \j,e^,s 2 ) = e^ T ^ ej ^ ej (7) 
and by construction of the prior, 

p(9 j \j, s 2 ) oc (^-^-fW^r^ p(s2) K (s2) - a -i e -^ 

It follows that 

p(0>' | s 2 ,j,x T ) oc I j, 9 j ,s 2 )p(9 j \j,s 2 ) oc gCW-^W, 
which proves the first assertion of the lemma. Next we write 
p(s 2 | 6 3 , j, a; T ) oc p(x T \ j, 6 3 , s 2 )p(8 3 | j, s 2 )p(s 2 ) 

which yields the second assertion. □ 

The lemma shows that Gibbs sampling can be used to sample (approximately) from 
the continuous observations posteriors of s 2 and 9^ within a fixed model j. In the next 
subsections we explain how to combine this with reversible jump steps between different 
models and with data augmentation through the generation of diffusion bridges in the case 
of discrete-time observations. 



3.2. Reversible jumps between models 

In this subsection we still assume that we have continuous data x T = (x t : t G [0, T]) 
at our disposal. We complement the within-model computations given by Lemma 1 with 
(a basic version of) reversible jump MCMC (cf. Green (1995)) to explore different models. 
We will construct a Markov chain which has the full posterior p(j, 9\ s 2 \ x T ) as invariant 
distribution and hence can be used to generate approximate draws from the posterior 
distribution of the drift function b. 

We use an auxiliary Markov chain on N with transition probabilities q(j' \j), j,j' G N. 
As the notation suggests, we denote by p(j \ s 2 ,x T ) the conditional (posterior) probability 
of model j given the parameter s 2 and the data x T . Recall that p(j) is the prior probability 
of model j . Now we define the quantities 



B(f\j) 



p{x T | j 1 , s 2 ^ 
p(x T I j, s 2 ) 



R(f 1 i) = p{f)q{j 1 f) (9) 

Note that B(j' \j) is the Bayes factor of model j' relative to model j, for a fixed scale s 2 . 
To simplify the notation, the dependence of this quantity on s 2 and x T is suppressed. 
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The overall structure of the algorithm that we propose is that of a componentwise 
Metropolis-Hastings (MH) sampler. The scale parameter s 2 is taken as component I and 
the pair (j, J ) as component II. Starting with some initial value (j , # J0 , Sq), alternately 
moves from (J, 0\ s 2 ) to (j, 6\ (s') 2 ) and moves from (J, 6* J , s 2 ) to (f, # J , s 2 ) are performed, 
where in each case the other component remains unchanged. 

Updating the first component is done with a simple Gibbs move, that is a new value 
for s 2 is sampled from its posterior distribution described by Lemma 1, given the current 
value of the remaining parameters. 

Move (I). Update the scale. Current state: {j,0^,s 2 ). 

• Sample (s') 2 ~ IG(a + (l/2)m i5 b + {l/2){0j) T {»?)- l 0i). 

• Update the state to {j,6 j , {s') 2 ). 

The second component (J, 0i) has varying dimension and a reversible jump move is 
employed to ensure detailed balance (e.g. Green (2003), Brooks et al. (2011)). To perform 
a transdimensional move, first a new model f is chosen and a sample from the posterior 
for given by Lemma 1 is drawn. 

Move (II). Transdimensional move. Current state: (j,0\s 2 ). 

• Select a new model f with probability q(j' \ j). 

• Sample ~ N m >.((W j ')~ l fi j ' , (W^y 1 ). 

• Compute r = B(f \ j)R(f \ j)- 

• With probability min{l,r} update the state to (f, 3 ' , s 2 ), else leave the state un- 
changed. 

All together we have now constructed a Markov chain Z , Zi, Z 2 , . . . on the transdimen- 
sional space E = [j jeN {j} x M mj x (0, oo), whose evolution can be described as follows. 



Continuous observations algorithm 

Initialization. Set Z = (j , 0™, Sq). 

Transition. Given the current state Zj = (j, 0\ s 2 ): 

• Sample (s') 2 ~ IG(a + (1/2)771^, b + {l/2){0i) T {EP)- 1 0i). 

• Sample f ~ q(j'\j), 

• Sample j ' ~ ^((WO'V', (W^ 1 ). 

• With probability min{l,S(j' | j)R(f \j)}, set 

= (j',^', ( S ') 2 ), else set Z j+1 = (j,0\ (s') 2 ). 
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Note that r = (q(J \ j')/q(j \ f))(p(f \ s 2 ,x T )/p(j' \ s 2 ,x T )), so effectively we perform 
Metropolis-Hastings for updating j. As a consequence, the vector of coefficients Qi needs 
to be drawn only in case the proposed j 1 is accepted. 

The following lemma asserts that the constructed chain indeed has the desired station- 
ary distribution. 

Lemma 2. The Markov chain Z , Z ly . . . has the posterior p(j, Qi ,s 2 \x T ) as invariant 
distribution. 

Proof. By Lemma 1 we have that in move I the chain moves from state [j,6\s 2 ) to 
(j,9\s' 2 ) with probability p(s' 2 \j,9^,x T ). Conditioning shows that we have detailed bal- 
ance for this move, that is, 

p(j, Q\ s 2 | x T )p(s' 2 | j, P, x T ) = p(j, 6 j , s' 2 | x T )p{s 2 | j, 9\ x T ). (10) 

In view of Lemma 1 again, the probability that the chain moves from state (j,0^,s 2 ) 
to (j', 0i ,s 2 ) in move II equals, by construction, 

s 2 ) -+ (j>, V\ s 2 )) = min {l, q ^}f\ V f\ S l ,X l\ }q{j' I iW 1 3\ s\ x T ). 

Now suppose first that the minimum is less than 1. Then using 

p(j, (P , s 2 | x T ) = p{Qi | j, s 2 , x T )p(j | s 2 , x T )p(s 2 | x T ) 

and 

p(f, 9 j ', s 2 1 x T ) = p(6 j ' I j", s 2 , x T )p(f | s 2 , x T )p(s 2 1 x T ) 
it is easily verified that we have the detailed balance relation 

p(j,8i,s 2 \x T )p((j,P,s 2 ) -> (f,9J',s 2 )) =p(j',P',s 2 \x T )p{(j',P',s 2 ) (j,P,s 2 )). 

for move II. The case that the minimum is greater than 1 can be dealt with similarly. 

We conclude that we have detailed balance for both components of our MH sampler. 
Since our algorithm is a variable-at-a-time Metropolis-Hastings algorithm, this implies the 
statement of the lemma (see for example section 1.12.5 of Brooks et al. (2011)). □ 

3. 3. Data augmentation for discrete data 

So far we have been dealing with continuously observed diffusion. Obviously, the phrase 
"continuous data" should be interpreted properly. In practice it means that the frequency 
at which the diffusion is observed is so high that the error that is incurred by approximating 
the quantities (5) and (6) by their empirical counterparts, is negligible. If we only have 
low-frequency, discrete-time observations at our disposal, these approximation errors can 
typically not be ignored however and can introduce undesired biases. In this subsection 
we explain how our algorithm can be extended to accommodate this situation as well. 



9 



We assume now that we only have partial observations xq,xa,---, x n A of our diffusion 
process, for some A > and n G N. We set T = nA. The discrete observations constitute 
a Markov chain, but it is well known that the transition densities of discretely observed 
diffusions and hence the likelihood are not available in closed form in general. This com- 
plicates a Bayesian analysis. An approach that has been proven to be very fruitful, in 
particular in the context of parametric estimation for discretely observed diffusions, is to 
view the continuous diffusion segments between the observations as missing data and to 
treat them as latent (function-valued) variables. Since the continuous-data likelihood is 
known (cf. the preceding subsection), data augmentation methods (see Tanner and Wong 
(1987)) can be used to circumvent the unavailability of the likelihood in this manner. 

As discussed in Van der Meulen and van Zanten (2013) and shown in a practical setting 
by Papaspiliopoulos et al. (2012), the data augmentation approach is not limited to the 
parametric setting and can be used in the present nonparametric problem as well. Prac- 
tically it involves appending an extra step to the algorithm presented in the preceding 
subsection, corresponding to the simulation of the appropriate diffusion bridges. If we 
denote again the continuous observations by x T = (x t : t G [0, T]) and the discrete-time 
observations by xa, ■ ■ ■ , inA, then using the same notation as above we essentially want to 
sample from the conditional distribution 

p(x T | j,9 ] ,s 2 ,x A , . . . ,x nA ). (11) 

Exact simulation methods have been proposed in the literature to accomplish this, e.g. 
Beskos et al. (2006a), Beskos et al. (2006b). For our purposes exact simulation is not 
strictly necessary however and it is more convenient to add a Metropolis-Hastings step 
corresponding to a Markov chain that has the diffusion bridge law given by (11) as sta- 
tionary distribution. 

Underlying the MH sampler for diffusion bridges is the fact that by Girsanov's theorem, 
the conditional distribution of the continuous segment X^ k ' = (X t : t G [(k — l)A,kA]) 
given that X(&_i)a = ^(it-i)A an d X kA = x k A, is equivalent to the distribution of a Brown- 
ian bridge that goes from x^-i)A at time (k — 1)A to x^a at time kA. The corresponding 
Radon-Nikodym derivative is proportional to 

pkA i pkA 

L k {X™ | b) = exp ( / b(X t ) dX t -- b\X t ) dt) . (12) 

V </(fc-l)A Z J(k-l)A ' 

We also note that due to the Markov property of the diffusion, the different diffusion 
bridges X^\ . . . ,X^ n \ can be dealt with independently. 

Concretely, the missing segments x^ = (x t ■ t G ((k — 1)A, kA)), k = 1, . . . , n can be 
added as latent variables to the Markov chain constructed in the preceding subsection, and 
the following move has to be added to moves I and II introduced above. It is a standard 
Metropolis-Hastings step for the conditional law (11), with independent Brownian bridge 
proposals. For more details on this type of MH samplers for diffusion bridges we refer to 
Roberts and Stramer (2001). 

Move (III). Updating the diffusion bridges. Current state: (j, Q\ s 2 , x^\ . . . , x^): 
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For k = 1, . . . , n, sample a Brownian bridge w^ k > from ((k— 1)A, x^-i)a) to (kA, xi-a)- 



• For k = 1, . . . , n, compute = Lk(w^ k ' \ b)/Lk(x^ k > \ b), for 6 = J2i< mj Qfyi- 

• Independently, for k — 1, . . . , n, with probability min{l, r^} update x^-* to w^ k \ else 



Of course the segments a^ 1 **, . . . , x^ can always be concatenated to yield a continuous 
function on [0,T]. In this sense move III can be viewed as a step that generates new, 
artificial continuous data given the discrete-time data. It is convenient to consider this 
whole continuous path on [0, T] as the latent variable. When the new move is combined 
with the ones defined earlier a Markov chain Zq, Z±, Z2, ... is obtained on the space E = 
\J jeN {j} x W 71 ^ x (0,oo) x C[0, T]. Its evolution can be described as follows. 



Discrete observations algorithm 

Initialization. Set Zq = (jo, 0^°, Sq, Xq ), where Xq is for instance 
obtained by linearly interpolating the observed data points. 

Transition. Given the current state Zj = (j, s 2 , x T ), construct fol- 



• Sample {s') 2 ~ IG(a + (l/2)m i; b + (1/2)(^') T (S^)- 1 ^), update s 2 to {s') 2 . 

• Sample / ~ q(j'\j) and ~ ^((W i ')~V i ' J (W- 7 ")" 1 )- 

• With probability min{l,.B(y | j)R(f \j)}, update (j, # J ) to else retain 



to (kA,x kA ) and compute r fe = L k {w^> | b)/L k (x^ \ b), for 6 = 
• Independently, with probability min{l,r/ c }, update x^ to w^ k \ else retain 2^. 



It follows from the fact that move III is a MH step for the conditional law (11) and Lemma 
2 that the new chain has the correct stationary distribution again. 

4. Simulation results 

The implementation of the algorithms presented in the preceding section involves the 
computation of several quantities, including the Bayes factors B(f \j) and sampling from 
the posterior distribution of 0i given j and s 2 . In Section 5 we explain in some detail 
how these issues can be tackled efficiently. In the present section we first investigate the 
performance of our method on simulated data. 

For the drift function, we first choose the function b(x) = 12(a(x) + 0.05) where 



retain jW. 



lows: 



• For k = 1, . . . , n, sample a Brownian bridge w^ k ' from {{k — 1) A, X(k-i)A) 




x e [0,2/3) 
x e [2/3,1] 



(13) 
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Figure 1: Left: drift function. Right: derivative of drift function 



This function is Holder-continuous of order 1.5 on [0, 1]. A plot of b and its derivative is 
shown in Figure 1. Clearly, the derivative is not differentiable in 0, 1/3 and 2/3. 

We simulated a diffusion on the time interval [0, 200] using the Euler discretization 
scheme with time discretization step equal to 10~ 5 . Next, we retained all observations at 
values t = iA with A = 0.001 and % — 0, . . . ,200.000. The data are shown in Figure 2. 
From the histogram we can see that the process spends most time near i = 1/3, so we 
expect estimates for the drift to be best in this region. For now, we consider the data as 
essentially continuous time data, so no data- augment at ion scheme is employed. 

We define a prior with Fourier basis functions as described in Section 2.2.1, choosing 
regularity /3 = 1.5. With this choice, the regularity of the prior matches that of the true 
drift function. 

For the reversible jump algorithm there are a few tuning parameters. For the model 
generating chain, we choose q(j \ j) = 1/2, q(j + 1 | j) = q(j — 1 | j) = 1/4. For the priors 
on the models we chose C = — log(0.95) which means that p(j) oc (0.95)" 1 ? expressing 
weak prior belief in a course (low) level model. For the inverse Gamma prior on the scale 
we take the hyper parameters a = b = 5/2. 

We ran the continuous time algorithm for 3000 cycles and discarded the first 500 it- 
erations as burn-in. We fix this number of iterations and burn-in for all other MCMC 
simulations. In Figure 3 we show the resulting posterior mean (dashed curve) and 90% 
pointwise credible intervals (visualized by the gray bars). The posterior mean was esti- 
mated using Rao-Blackwellization (Robert (2007), section 4.2). (Specifically, the posterior 
mean drift was not computed as the pointwise average of the drift functions b sampled at 
each iteration. Rather, the average of the posterior means (W-? )~ 1 /J? obtained at each 
MCMC iteration (see move II) was used.) 

Insight in the mixing properties of the Markov chain is gained by considering traces of 
the sampled drift function at several fixed points as shown in Figure 4. The trace plots 
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Figure 2: Left: simulated data. Right: histogram of simulated data modulo 1. 

indicate that indeed the first 200-300 iterations should be discarded. Plots of the visited 
models over time and the corresponding acceptance probabilities are shown in figures 5 
and 6 respectively. The mean and median of the scaling parameter s 2 are given by 1.91 
and 1.64 respectively (computing upon first removing burn-in samples). 

To judge the algorithm with respect to the sensitivity of C, we ran the same algorithm 
as well for C = 0. The latter choice is often made and reflects equals prior belief on all 
models under consideration. If C — 0, then the chain spends more time in higher models. 
However, the posterior mean and pointwise credible bands are practically equal to the case 
C = -log(0.95). 

To get a feeling for the sensitivity of the results on the choice of the hyper-parameters 
a and b, the analysis was redone for a = b = 5. The resulting posterior mean and credible 
bands turned out to be indistinguishable from the case a = b = 2.5. 

Clearly, if we would have chosen an example with less data, then the influence of the 
priors would be more strongly pronounced in the results. 

4-1- Effect of the prior on the model index 

If, as in the example thus far, the parameter (3 is chosen to match the regularity of the 
true drift, one would expect that using a prior where the truncation point for in the series 
expansion of the drift is fixed at a high enough level, one would get results comparable to 
those we obtained in Figure 3. If we fix the level at j = 30, this indeed turns out to be the 
case. The main advantage of putting a prior on the truncation level and using a reversible 
jump algorithm is an improvement in computing time. For this example, a simulation run 
for the reversible jump model took about 55% of that for the fixed dimensional model. 
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Figure 3: Left: true drift function (red, solid) and samples from the posterior distribution. Right: drift 
function (red, solid), posterior mean (black, dashed) and 90% pointwise credible bands 
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Figure 4: Trace and running mean of the sampled drift at different design points. The color of the samples 
indicates the current model, cold colors correspond to small values of j. 
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Figure 7: Drift function (red, solid), posterior mean (black, dashed) and 90% pointwise credible bands. 
Left: s 2 = 0.25 fixed. Middle: s 2 = 50 fixed. Right: random scaling parameter (Inverse gamma prior with 
a = b = 2.5). 



For the reversible jump run, the average of the (non-burn-in) samples of the model index 
j equals 18. 

4-2. Effect of the random scaling parameter 

To assess the effect of including a prior on the multiplicative scaling parameter, we ran 
the same simulation as before, though keeping s 2 fixed to either 0.25 (too small) or 50 (too 
large). For both cases, the posterior mean, along with 90% pointwise credible bounds, is 
depicted in Figure 7. For ease of comparison, we added the right-hand-figure of Figure 
3. Clearly, fixing s 2 = 0.25 results in oversmoothing. For s 2 = 50, the credible bands 
are somewhat wider and suggest more fluctuations in the drift function than are actually 
present. 

4-3. Effect of mis specifying smoothness of the prior 

If we consider the prior without truncation, then the smoothness of the prior is es- 
sentially governed by the decay of the variances on the coefficients. This decay is deter- 
mined by the value of f3. Here, we investigate the effect of misspecifying (3. We consider 
(3 G {0.25, 1.5, 3}. In Figure 8 one can assess the difference in posterior mean, scaling pa- 
rameter and models visited for these three values of (3. Naturally, if (3 is too large, higher 
level models and relatively large values for s 2 are chosen to make up for the overly fast 
decay on the variances of the coefficients. Note that the boxplots are for log(s 2 ), not s 2 . 

It is interesting to investigate what happens if the same analysis is done without the 
reversible jump algorithm, thus fixing a high level truncation point. The results are in 
Figure 9. From this figure, it is apparent that if f3 is too small the posterior mean is very 
wiggly. At the other extreme, if (3 is too large, we are oversmoothing and the true drift 
is outside the credible bands except near x = 1/3. As such, misspecifying (3 can result 
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in very bad estimates if a high level is fixed. From the boxplots in Figures 8 and 9 one 
can see that the larger /3, the larger the scaling parameter. Intuitively, this makes sense. 
Moreover, in case we employ a reversible jump algorithm, a too small (large) value for fi 
is compensated for by low (high) level models. 

4-4- Results with Schauder basis 

The complete analysis has also been done for the Schauder basis. Here we take q(j \ 
j) = 0.9, q(j + 1 | j) — q(j — 1 | j) — 0.1. The conclusions are as before. For the main 
example, the computing time with the Schauder basis was approximately 15% of that for 
the Fourier basis. 

4-5. Discrete-time data and data- augmentation 

Here, we thin the "continuous" -time data to discrete-time data by retaining every 50th 
observation from the continuous-time data. The remaining data are hence at times t = iA 
with A = 0.05 and i = 0, . . . ,4000. Next, we use our algorithm both with and without 
data-augmentation. Here, we used the Schauder basis to reduce computing time. The 
results are depicted in Figure 10. 

The leftmost plot clearly illustrates that the discrete-time data with A = 0.5 cannot be 
treated as continuous-time data. The bias is quite apparent. Comparing the middle and 
the rightmost plot shows that data-augmentation works very well in this example. 

We also looked at the effect of varying T (observation time horizon) and A (time 
in between discrete time observations). In all plots of Figure 11 we used the Schauder 
basis with /3 = 1.5 and data augmentation with 49 extra imputed points in between two 
observations. In the upper plots of Figure 11 we varied the observation time horizon while 
keeping A = 0.1 fixed. In the lower plots of Figure 11 we fixed T = 500 and varied A. As 
expected, we see that the size of the credible bands decreases as the amount of information 
in the data grows. 

Lastly, Figure 12 illustrates the influence of increasing the number of augmented ob- 
servations on the mixing of the chain. Here we took A = 0.2 and T = 500 and compare 
trace plots for two different choices of the number of augmented data points, in one case 
25 data points per observation and in the second case 100 data points per observations. 
The mixing does not seem to deteriorate with a higher number of augmented observations. 

4-6. Performance of the method for various drift functions 

In this section we investigate how different features of the drift function influence the 
results of our method. The drift functions chosen for the comparison are 

1. bi(x) = 8sin(47rx), 

2. b 2 (x) = 200x(l - 2x) 3 l [0 i } {x) - ^(1 - x){2x - l) 3 l { i_ A) {x), where x = x mod 1, 

3. 63(0;) = — 8 sin(7r(4x — l))l[i |](x modi). 
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Figure 8: Reversible jump; from left to right f3 = 0.25, (3 = 1.5 and ,9 = 3. Upper figures: posterior 
mean and 90% pointwise credible bands. Middle figure: boxplots of log(s 2 ). Lower figures: histogram of 
model-index. 
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Figure 9: Fixed level; from left to right (3 = 0.25, j3 = 1.5 and = 3. Upper figures: posterior mean and 
90% pointwise credible bands. Lower figure: boxplots of log(s 2 ). 
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Figure 10: Drift function (red, solid), posterior mean (black, dashed) and 90% pointwise credible 
bands. Discrete-time data. Left: without data-augmentation. Middle: with data-augmentation. Right: 
continuous-time data. 





Figure 11: Drift function (red, solid), posterior mean (black, dashed) and 90% pointwise credible bands. 
Upper: Discrete time observations with A = 0.1. From left to right T = 50, T = 200 and T — 500. Lower: 
Discrete time observations with T — 500. From left to right A = 1, A = 0.2 and A = 0.1. (All augmented 
to <5 = 0.002.) 
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Figure 12: Trace plots illustrating the influence data augmentation on the mixing of the chain. 2500 
observations in [0,500]. Top: 25 data points per observation. Bottom: 100 data points per observation. 
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Figure 13: Drift function (red, solid), posterior mean (black, dashed) and 90% pointwise credible bands. 
From left to right: b±, &2i and 63. 



As a prior we took the Fourier basis with parameter (3 = 1.5. For s 2 we took an inverse 
Gamma prior with hyper parameters a = b = 5/2. For each drift function, 2000 observa- 
tions with A = 0.1 were generated. The algorithm was then used with data augmentation 
with 49 imputed points extra in between two observations. The results for b\, 62 and 63 are 
in figure 13. We ran the analysis for the Schauder basis as well, which led to very similar 
results. 

4-7. Comparison with Papaspiliopoulos et al. (2012): Butane Dihedral Angle Time Series 
To compare our approach to that of Papaspiliopoulos et al. (2012) we analyzed the 
butane dihedral angle time series considered by these authors. After some preliminary op- 
erations on the data, these data are assumed to be discrete time observations from a scalar 
diffusion with unit-diffusion coefficient (details on this are described in Papaspiliopoulos 
et al. (2012) and supplementary material to this article). After this preliminary step, the 
time series consists of 4000 observations observed evenly over the time interval [0, 4] (time 
is measured in nanoseconds). The right-hand figure of Figure 14 shows a histogram of the 
discrete time observations. 

Papaspiliopoulos et al. (2012) use a centered Gaussian process prior with precision 
operator (2), with r\ = 0.02, k = and p — 2. This choice for p yields a prior of Holder 
smoothness essentially equal to 1.5. As explained in the introduction, this is essentially 
the law of the random function 

00 ^ 

where ipi are the Fourier basis functions defined in Section 2.2.1 and the Z\ are independent 
standard normal random variables. Note that with rj = 0.02 we have (7r 4 r]) _1 « 0.51. To 
match as closely as possible with their prior specification, we use the Fourier basis with 
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(3 = 1.5, as described in Section 2.2.1. Conditionally on s 2 and j, our prior then equals the 
law of the random function 

2i-i 

a: H> -Z^x). 
i=i 

For s 2 we took an inverse Gamma prior with hyper parameters a = b = 5/2. For this 
choice the prior mean for s 2 equals 0.5 which is close to (jr r])" 1 . 

For the prior on the models we use p(j) oc (0.95) mj . As before, we ran the continu- 
ous time algorithm for 3000 cycles and discarded the first 500 iterations as burn-in. We 
used data augmentation with 99 extra augmented time points in between two successive 
observations. 

The left-hand figure in Figure 14 shows the posterior mean and pointwise 68% credible 
bands, both for our approach and that of Papaspiliopoulos et al. (2012). Overall, the 
posterior means computed by both methods seem to agree very well except for the boundary 
areas. In these areas, the credible bounds are wider, since we have less information about 
the drift here. In Figure 15 histograms of the scaling parameter s 2 and the truncation level 
J are shown. Clearly, s 2 takes values typically around a value as large as 5000, much larger 
than 0.51. This illustrates once again the usefulness of equipping the scaling parameter 
with a prior distribution. The fact that our credible bands are wider near the boundary 
of the observation area seems to indicate that Papaspiliopoulos et al. (2012) are somewhat 
overconfident about the form of the drift function in that area. Their narrower credible 
bands seem to be caused by prior belief rather than information in the data and are not 
corroborated by our more conservative approach. 

5. Numerical considerations 

5.1. Drawing coefficients from the posterior within a fixed model 

For move II the algorithm requires to sample a random vector U ~ 
A 7 m .((W- J ) _1 /i ;? , (W J ) _1 ). In order to do so we first compute the Cholesky decomposi- 
tion of (note that is symmetric and positive definite, ensuring its existence). For 
an upper triangular matrix M J we then have = {M^) T MK Next we let solve the 
system (M J ) T V = y? , draw a standard normal vector Z ~ N m .(0, 1) and construct U by 
backward solving 

M j U = z j + Z. (14) 

It is easily seen that the random vector U has the required distribution. 

Backsolving linear equations with triangular matrices requires 0(m 2 ) operations. 
Cholesky factors are computed in 0(m^) operations in general, but for basis functions 
with local support, S J and are sparse, enabling enabling faster computations. For 
the Schauder basis, the number of non-zero elements of the upper triangular part of 
S- 7 is 2 J_1 (j — 1) + 1, so the fraction of non-zero elements of S- 7 is approximately 
1.00, 1.00, 0.88, 0.66, 0.45, 0.28, 0.17, ... for j = 1, 2, 3, 4, 5, 6, 7, . . . The Cholesky factor of 
a sparse matrix is not necessarily sparse as well. However, the sparsity pattern originating 
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Figure 14: Comparison of the estimate of drift using the Butane Dihedral Angle data. Red solid: A Fourier 
prior with /? = 1.5. Blue dashed: Results of Papaspiliopoulos ct al. (2012). The posterior mean with 68% 
credible bands is pictured. Right: Histogram of the data. 
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from the tree structure of the supports of the Schauder elements enables to specify a perfect 
elimination ordering of the rows and columns of S- 7 (for details we refer to Rose (1970)). 
This means that the Cholesky factor inherits the sparsity. Moreover, the Cholesky factor- 
ization can be computed on the sparse representation of the matrix S- 7 . The particular 
reordering necessary - reversing the order of rows and columns - makes this technique 
applicable for moves within levels. 

5.2. Computation of the Bayes factors 

Our algorithms require the evaluation of the Bayes factors defined by (8). The following 
lemma is instrumental in the numerical evaluation of these numbers. Recall the definitions 
of ^ , S- 7 and in Section 3.1. 

Lemma 3. We have 

/ t \ ■ 2, exp(|(^) r (^)-V) 
plx \J,S ) = . = -. 

Proof. Since 

p(x T | j, s 2 ) = J p(x T I j, d\ s 2 )p{V I j, s 2 )d6> 
we have, by (7) and the definition of the prior, 

1 f fai^T.jl, 



p(x T \j,s 2 ) = , / e^y» 3 -^ eJ y w3d3 dei. (15) 

y/\2TTS 2 Zl\ J 

By completing the square we see that this is further equal to 

1 c jr(M J ) T (^ry f e -\(e ] -{w^^) T w^(e^-{w^^) d Qj 
v/|27rs 2 Hi| J 

' :e i(^) T (^)-Vy| 27r (^)-i|. 



This completes the proof. □ 
As a consequence of Lemma 3, we have 

21og£(j" | j) = (/f^')- 1 // - (^f(W^)-V + log ( 'f^J ). (16) 

We now show how the right- hand- side of the display can be evaluated in a numerical 
efficient and stable way. In the context of Gaussian Markov random fields related tricks 
have been used in Rue et al. (2009). 

Suppose f — j — k > (if k — 0, B(f \ j) = 1 and for k < the calculations are 
similar.) First we compute and the Cholesky decomposition of Wi +k (the matrix 
is symmetric and positive definite, so its Cholesky decomposition exists). We obtain an 
upper triangular matrix M J+ such that 

W j+k = (M j+k ) T M j+k . 

Next we apply the following theorem, taken from Stewart (1998) (cf. Theorem 1.6 therein). 
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Theorem 4. Suppose the matrix A can be portioned as 

A u A 12 
A21 A 2 2 

where An is nonsingular. Then A has a block LU decomposition 



A n A 12 
A21 A 2 2 



Ln 
L 2 i L22 



U 22 ^ 

where Ln and Un are nonsingular. For such decomposition An = LnUn- 

The Cholesky decomposition factor M- 7 of hence equals the upper left block of 
M J+fc , which is obtained by retaining only the first mj rows and columns of Mj + k- Also 
note that the vector jj is obtained from /z J+fc by retaining only the first mj elements. 
Now if z j+k is the solution to (M^ +k ) T z^ +k = ^ +k , then (^ +k ) 7 \W j+k )~ l fj, j+k = 

H j , then z j+k = [z j ,g j+k ] 



\\z : ' +k \\ 2 . If we similarly define as the solution to (M J ) T V 
where g^ +k contains the last rrij + k — rrij elements of z^ +k . Therefore 



(/x' + Y(w i+fc )~V' + * 

Furthermore, 

, / \s 2 W j Ei\ 
log 



(//•'j'ni-') y 



\z j+k \\ 2 -\\z>\\ 2 



\9 



j+k\\2 



The second term on the right-hand side equals 



ri: '^ k \W^\ 

£ lo ^ 2 ) + lo g(r^)- 



2 log 



I API 



\\Mi+ k \J 



-2 £ ^ M U k - 

i=m,j+l 



Therefore, we have 

2logB{j'\j) = \\g 
We can summarize our findings as follows. 



m j+k 



j+k\\2 



2 J2 lo s(<<^; 

i=m,j+l 



j+k 
i.i 



;i7) 



Algorithm to compute the Bayes factor B(f \ j) for f = j + k 



Compute /j, j+k , Z j+k and W j+k . 

Obtain the Cholesky decomposition of W j+k so that W j+k = (M j+k ) T M j+k 
for an upper-triangular matrix AF +fc . 

Solve zi + from {M^ +k ) T z^ +k = ^ +k and partition the solution into 
z j+k _ where dim(z- J ) = m, 

Compute B(f \ j) from (17). 



.1 ■ 
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6. Concluding remarks 

Estimation of diffusion processes has attracted a lot of attention in the past two decades. 
Within the Bayesian setup very few articles have considered the problem of nonparametric 
estimation. In this article we propose an alternative approach to the method detailed in 
Papaspiliopoulos et al. (2012). From the simulations it turns out that our method can 
provide good results. 

The simulation results indicate that the posterior mean can be off the truth if the prior 
specification is inappropriate in the sense that 

• the multiplicative scale s is fixed at a value either too high or too low; 

• a truncation level is fixed and the smoothness of the prior (governed by f3) is chosen 
inappropriately. 

The first of these problems can be circumvented by specifying a prior distribution on the 
scaling parameter. As regards the second problem, endowing the truncation level with 
a prior and employing a reversible jump algorithm, it turns out that reasonable results 
can be obtained if we erroneously undersmooth by choosing the regularity of the prior too 
small. For a fixed high truncation level this is certainly not the case. In case the prior is 
smoother than the true drift function, both reversible jumps and a fixed high-level model 
can give bad results. Overall however, simulation results indicate that our method is more 
robust against prior misspecification. 

It will be of great interest to complement our numerical results with mathematical re- 
sults providing theoretical performance guarantees and giving further insight in limitations 
as well. Another interesting possible extension is to endow the regularity parameter /3 with 
a prior as well and let the data determine its appropriate value. This destroys the partial 
conjugacy however and it is a challenge to devise numerically feasible procedures for this 
approach. 
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